Q-value Heuristics for Approximate Solutions of Dec-POMDPs
نویسندگان
چکیده
The Dec-POMDP is a model for multi-agent planning under uncertainty that has received increasingly more attention over the recent years. In this work we propose a new heuristic QBG that can be used in various algorithms for Dec-POMDPs and describe differences and similarities with QMDP and QPOMDP. An experimental evaluation shows that, at the price of some computation, QBG gives a consistently tighter upper bound to the maximum value obtainable. Introduction In recent years the artificial intelligence (AI) community has shown an increasing interest in multi-agent systems (MAS), thereby narrowing the gap between game theoretic and decision theoretic reasoning. Especially popular are frameworks based on Markov decision processes (MDPs) (Puterman 1994). In this paper we focus on the decentralized partially observable Markov decision process (Dec-POMDP), a variant for multi-agent (decentralized) planning in stochastic environments that can only be partially observed. Examples of application fields for Dec-POMDPs are cooperative robotics, distributed sensor networks and communication networks. Two specific examples are by EmeryMontemerlo et al. (2004), who considered multi-robot navigation in which a team of agents with noisy sensors has to act to find/capture a goal, and Becker et al. (2004), who introduced a multi-robot space exploration example in which the agents (mars rovers) have to decide on how to proceed their mission. Unfortunately, optimally solving Dec-POMDPs is provably intractable (Bernstein et al. 2002), leading to the need for smart approximate methods and good heuristics. In this work we focus on the latter, presenting a taxonomy of heuristics for Dec-POMDPs. Also we introduce a new heuristic, dubbed QBG, which is based on Bayesian games (BGs) and therefore, contrary to the other described heuristics, takes into account some level of decentralization. The mentioned heuristics could be used by different methods, but we particularly focus on the approach by EmeryMontemerlo et al. (2004), because this approach gives a very natural introduction to the application of BGs in DecPOMDPs. This paper is organized as follows: First, the DecPOMDP is formally introduced. Next, we describe how heuristics can be used to find approximate policies using BGs. Different heuristics including the new QBG heuristic are placed in a taxonomy after that. Before we conclude with a discussion, we present a preliminary experimental evaluation of the different heuristics. The Dec-POMDP framework Definition 1 A decentralized partially observable Markov decision process (Dec-POMDP) with m agents is defined as a tuple 〈S,A, T,R,O, O〉 where: • S is a finite set of states. • The set A = ×iAi is the set of joint actions, where Ai is the set of actions available to agent i. Every time-step, one joint action a = 〈a1, ..., am〉 is taken. Agents do not observe each other’s actions. • T is the transition function, a mapping from states and joint actions to probability distributions over states: T : S ×A → P(S).1 • R is the immediate reward function, a mapping from states and joint actions to real numbers: R : S ×A → R. • O = ×iOi is the set of joint observations, where Oi is a finite set of observations available to agent i. Every timestep one joint observation o = 〈o1, ..., om〉 is received, from which each agent i observes its own component oi. • O is the observation function, a mapping joint actions and successor states to probability distributions over joint observations: O : A× S → P(O). The planning problem is to find the best behavior, or an optimal policy, for each agent for particular number of timesteps h, also referred to as the horizon of the problem. Additionally, the problem is usually specified along with an initial ‘belief’ b ∈ P(S); this is the initial state distribution at time t = 0.2 The policies we are looking for are mappings from the histories the agents can observe to actions. Therefore we will first formalize two types of histories: 1We use P(X) to denote the infinite set of probability distributions over the finite set X . 2Unless stated otherwise, all superscripts are time step indices. Definition 2 We define the action-observation history for agent i, ~ θ t i , as the sequence of actions taken and observations received by agent i until time-step t: ~ θ t i = ( ai , o 1 i , a 1 i , ..., a t−1 i , o t i ) . (1) The joint action-observation history is a tuple with the action-observation history for all agents ~ θ t = 〈 ~ θ t 1 , ..., ~ θ t m 〉
منابع مشابه
Optimal and Approximate Q-value Functions for Decentralized POMDPs
Decision-theoretic planning is a popular approach to sequential decision making problems, because it treats uncertainty in sensing and acting in a principled way. In single-agent frameworks like MDPs and POMDPs, planning can be carried out by resorting to Q-value functions: an optimal Q-value function Q is computed in a recursive manner by dynamic programming, and then an optimal policy is extr...
متن کاملPoint-based incremental pruning heuristic for solving finite-horizon DEC-POMDPs
Recent scaling up of decentralized partially observable Markov decision process (DEC-POMDP) solvers towards realistic applications is mainly due to approximate methods. Of this family, MEMORY BOUNDED DYNAMIC PROGRAMMING (MBDP), which combines in a suitable manner top-down heuristics and bottom-up value function updates, can solve DEC-POMDPs with large horizons. The performances of MBDP, can be,...
متن کاملBounded Dynamic Programming for Decentralized POMDPs
Solving decentralized POMDPs (DEC-POMDPs) optimally is a very hard problem. As a result, several approximate algorithms have been developed, but these do not have satisfactory error bounds. In this paper, we first discuss optimal dynamic programming and some approximate finite horizon DEC-POMDP algorithms. We then present a bounded dynamic programming algorithm. Given a problem and an error bou...
متن کاملAn Investigation into Mathematical Programming for Finite Horizon Decentralized POMDPs
Decentralized planning in uncertain environments is a complex task generally dealt with by using a decision-theoretic approach, mainly through the framework of Decentralized Partially Observable Markov Decision Processes (DEC-POMDPs). Although DEC-POMDPS are a general and powerful modeling tool, solving them is a task with an overwhelming complexity that can be doubly exponential. In this paper...
متن کاملApproximate Solutions for Factored Dec-POMDPs with Many Agents — Extended Abstract1
Dec-POMDPs are a powerful framework for planning in multiagent systems, but are provably intractable to solve. This paper proposes a factored forward-sweep policy computation method that tackles the stages of the problem one by one, exploiting weakly coupled structure at each of these stages. An empirical evaluation shows that the loss in solution quality due to these approximations is small an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007